Emotion recognition based on physiological signals is affected by noise and other factors, resulting in low accuracy and weak cross-individual generalization ability. Concerning the issue, a multimodal emotion recognition method based on ElectroEncephaloGram (EEG), ElectroCardioGram (ECG), and eye movement signals was proposed. Firstly, physiological signals were performed multi-scale convolution to obtain higher-dimensional signal features and reduce parameter size. Secondly, self-attention was employed in the fusion of multimodal signal features to enhance the weights of key features and reduce feature interference between modalities. Finally, a Bi-directional Long Short-Term Memory (Bi-LSTM) network was used for extraction of temporal information of fused features and classification. Experimental results show that, the proposed method achieves recognition accuracies of 90.29%, 91.38%, and 83.53% for valence, arousal, and valence/arousal four-class recognition tasks, respectively, with improvements of 3.46-7.11 and 0.92-3.15 percentage points compared to the EEG single-modality and EEG+ECG bimodal methods. The proposed method can accurately recognize emotion with better recognition stability between individuals.